by Samit Chaudhuri
This project explores relationship between quality and chemical properties of wines. Wine quality is based on subjective ratings by human experts. Chemical properties are determined by objective measurements. Two tidy data sets of 1599 red wines and 4898 white wines, each with 11 input variables and 1 output variables are used.
Additionally some models are presented to predict the quality of wines from their chemical characteristics.
setwd("~/personal/repos/github/analyze-wines/")
redWines <- read.csv("wineQualityReds.csv")
redWines <- subset(redWines, select = -c(X))
str(redWines)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
library(ggplot2)
library(GGally)
## Warning: replacing previous import by 'grid::arrow' when loading 'GGally'
## Warning: replacing previous import by 'grid::unit' when loading 'GGally'
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
##
## The following object is masked from 'package:scales':
##
## percent
##
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
##
## The following objects are masked from 'package:base':
##
## as.array, trimws
library(gridExtra)
qualityLevel = function(quality) {
level = as.numeric(cut(quality, breaks=c(0, 4, 5, 6, 10)))
level
return(level)
}
rpalette = 5
wpalette = 1
ptfont = 10 # font size of plot title
atfont = 9 # font size of axis title
ltfont = 8 # font size of legend title
The red wine data has one dependent variable, quality, that gives each wine a score between 0 and 10. There are 11 independent variables, some of with may be correlated.
Let us first find the distribution wine quality in the sample data set. Then we use the basic histogram to categorize quality scores into 4 levels: poor, good, fine, and excellent. This new factor variable, quality.label, is computed by cutting the quality number into 4 ranges (0,3], (0,5], (5,6], (6,10]. Now we can add some color to the previous histogram.
ptfont = 10 # font size of plot title
atfont = 9 # font size of axis title
ltfont = 8 # font size of legend title
rqhist1 <- ggplot(data=redWines, aes(x=quality)) +
ggtitle("Red Wine Quality") + # Plot title text
xlab("Wine Quality") + ylab("Count") + # Axis label text
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(size=atfont, face="bold")) +
scale_x_continuous(limits=c(3,9), breaks = seq(3, 9, 1)) +
geom_histogram(binwidth = 1)
redWines$quality.label = cut(redWines$quality, breaks=c(0, 4, 5, 6, 10),
labels=c("poor", "average", "fine", "excellent"))
rqhist2 <- ggplot(data=redWines, aes(x=quality, fill=quality.label)) +
ggtitle("Red Wine Categories") + # Plot title text
xlab("Wine Quality") + ylab("Count") + # Axis label text
scale_fill_brewer(type = 'div', palette = rpalette, # Legend title/color
guide=guide_legend(reverse=TRUE, title="Quality Rating")) +
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(size=atfont, face="bold"), # Axis font
legend.title=element_blank(),
legend.position=c(.9,.7)) + # Legend position
scale_x_continuous(limits=c(3,9)) +
geom_histogram(binwidth = 1)
grid.arrange(rqhist1, rqhist2, ncol=2)
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 8 rows containing missing values (geom_bar).
So we have a decent categorization of excellent, fine, good, and poor red wines.
In the beginning, we don’t really know what causes a wine to be excellent or poor. So we want the entire data set speak for itself by plotting many variables all at once. Hopefully the plots will draw our attention to a smaller set of variables and relationships.
#ggpairs(redWines, params = c(color=I('red3'), shape = I('.'), corSize = 8, outlier.shape = I('.')), title = "Red Wines Scatter Plot Matrix") + theme(axis.title.x=element_text(size=8,angle=45, vjust=1), axis.title.y=element_text(size=8,angle=45, hjust=1), axis.text=element_blank())
#ggpairs(redWines, params = c(color=I('red3'), shape = I('.'), corSize = 8, outlier.shape = I('.')), title = "Red Wines Scatter Plot Matrix") + theme(size=8, axis.text=element_blank())
ggpairs(redWines, params = c(color=I('red3')), title = "Red Wines Scatter Plot Matrix") + theme(axis.text=element_blank())
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Some of the dependent variables are highly correlated. For example, citric acid is correlated with both fixed and volatile acidity. Density is correlated with both fixed acidity and alcohol content.
The quality score is most correlated with alcohol content and volatile acidity, but it does not correlate much with anything else. On the other hand, the box plots of the quality label (excellent, fine, good, poor) show a dependency on fix and volatile acidity; excellent wines have higher mean fixed acidity and lower mean volatile acidity.
We will closely examine the effects of alcohol and acidity (both fixed and volatile) on quality score and quality label with grouped density plots.
ptfont = 10 # font size of plot title
atfont = 9 # font size of axis title
ltfont = 8 # font size of legend title
alDensity <- ggplot(data=redWines, aes(x=alcohol, color=quality.label)) +
ggtitle("Quality vs Alcohol") + # Plot title text
xlab("Alcohol Content") + ylab("Count") +
scale_color_brewer(type = 'div', palette = rpalette, # Legend title/color
guide=guide_legend(reverse=TRUE)) +
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(size=atfont, face="bold"), # Axis font
legend.title=element_blank(),
legend.position=c(.85,.6)) +
geom_density()
acidDensity <- ggplot(data=redWines) +
ylab("Count") +
scale_color_brewer(type = 'div', palette = rpalette, # Legend title/color
guide=guide_legend(reverse=TRUE)) +
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(face="bold"), # Axis font
legend.title=element_blank(),
legend.position=c(.85,.6))
vaDensity <- acidDensity +
geom_density(aes(x=volatile.acidity, color=quality.label)) +
ggtitle("Quality vs Volatile Acidity") + # Plot title
xlab("Volatile Acidity")
faDensity <- acidDensity +
geom_density(aes(x=fixed.acidity, color = quality.label)) +
ggtitle("Quality vs Fixed Acidity") + # Plot title
xlab("Fixed Acidity")
grid.arrange(alDensity, vaDensity, faDensity, nrow=2, ncol=2)
The density plots of different quality levels are shifted with respect to each other. The alcohol density plots shift toward right with improving quality. Better wines tend of have higher alcohol. The fixed acidity plots also shift toward right with improving quality, while volatile acidity density plots shift toward left with improving quality. Better wines tend of have higher fixed acidity and lower volatile acidity.
These trends can also be observed with box plots.
ptfont = 10 # font size of plot title
atfont = 9 # font size of axis title
ltfont = 8 # font size of legend title
albox <- ggplot(data=redWines, aes(x=quality.label, y=alcohol,
color=quality.label)) +
ggtitle("Alcohol vs Quality") + # Plot title text
xlab("Qality Label") + ylab("Alcohol Stats") + # Axis label text
scale_color_brewer(type = 'div', palette = rpalette) +# Legend title/color
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(size=atfont, face="bold"), # Axis font
legend.title=element_text(size=ltfont, face="bold"), # Legend font
legend.position="none") + # Legend position
geom_boxplot()
abox <- ggplot(data=redWines, aes(x=quality.label, color=quality.label)) +
# Plot title text
xlab("Qality Label") + # Axis label text
scale_color_brewer(type = 'div', palette = rpalette)+ # Legend color
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(size=atfont, face="bold"), # Axis font
legend.title=element_text(size=ltfont, face="bold"), # Legend font
legend.position="none") # Legend position
par(mfrow=c(1,2))
vabox <- abox + geom_boxplot(aes(y=volatile.acidity)) +
ggtitle("Volatile Acidity vs Quality") +
ylab("Volatile Acidity Stats")
fabox <- abox + geom_boxplot(aes(y=fixed.acidity)) +
ggtitle("Fixed Acidity vs Quality") +
ylab("Fixed Acidity Stats")
grid.arrange(albox, vabox, fabox, nrow=2, ncol=2)
Now let’s observe the joint effect of volatile acidity and alcohol content on quality with a scatter plot. Side-by-side, we also plot the joint effect of fixed acidity and alcohol content.
alascatter <- ggplot(data=redWines, aes(x=alcohol,
group=quality.label, fill=quality.label, color=quality.label)) +
scale_x_continuous(limits=c(8, 15), breaks = seq(8, 15, .5)) +
xlab("Alcohol Percentage") + # Axis label text
scale_color_brewer(type = 'div', palette = rpalette,
guide=guide_legend(reverse=TRUE)) + # Legend color
scale_fill_brewer(guide = FALSE) +
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(size=atfont, face="bold"), # Axis font
legend.title=element_blank(), # Legend font
legend.position=c(.85,.6)) # Legend position
alvascatter <- alascatter +
ggtitle("Volatile Acidity and Alcohol Content of Red Wines") +
ylab("Volatile Acidity") +
scale_y_continuous(limits=c(0.1, 1.6), breaks = seq(0.1, 1.6, .1)) +
geom_point(aes(y=volatile.acidity, size=3))
alfascatter <- alascatter +
ggtitle("Fixed Acidity and Alcohol Content of Red Wines") +
ylab("Fixed Acidity") +
scale_y_continuous(limits=c(3, 16), breaks = seq(3, 16, 1)) +
geom_point(aes(y=fixed.acidity, size=3))
alvascatter
alfascatter
Red wine quality is definitely good when alcohol content is above 12 and volatile acidity is below 0.5. Even when alcohol content is not quite as high, one can find good wines as with low volatile acidity, or with high fixed acidity. Similarly when volatile acidity is not quite as low, or fixed acidity is not quite as high, one can find good wines with high alcohol content. In other words, as long has either alcohol content is above 12, or volatile acidity is below 8 or fixed acidity is above 8, there is a good chance that the wine has good quality.
First we start with a full linear regression model.
fulllm <- lm(data=redWines, quality ~ . - quality.label)
summary(fulllm)
##
## Call:
## lm(formula = quality ~ . - quality.label, data = redWines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68911 -0.36652 -0.04699 0.45202 2.02498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.197e+01 2.119e+01 1.036 0.3002
## fixed.acidity 2.499e-02 2.595e-02 0.963 0.3357
## volatile.acidity -1.084e+00 1.211e-01 -8.948 < 2e-16 ***
## citric.acid -1.826e-01 1.472e-01 -1.240 0.2150
## residual.sugar 1.633e-02 1.500e-02 1.089 0.2765
## chlorides -1.874e+00 4.193e-01 -4.470 8.37e-06 ***
## free.sulfur.dioxide 4.361e-03 2.171e-03 2.009 0.0447 *
## total.sulfur.dioxide -3.265e-03 7.287e-04 -4.480 8.00e-06 ***
## density -1.788e+01 2.163e+01 -0.827 0.4086
## pH -4.137e-01 1.916e-01 -2.159 0.0310 *
## sulphates 9.163e-01 1.143e-01 8.014 2.13e-15 ***
## alcohol 2.762e-01 2.648e-02 10.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
Now we select only the signficant variables, and then leave out collinear variables, citric.acid, pH, and density because they are highly correlated with acidity.
redWines$quality = as.numeric(redWines$quality)
reducedlm <- lm(data=redWines, quality ~ .
- citric.acid - density - pH - quality.label)
summary(reducedlm)
##
## Call:
## lm(formula = quality ~ . - citric.acid - density - pH - quality.label,
## data = redWines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7772 -0.3728 -0.0605 0.4520 2.0213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7099270 0.2332016 11.621 < 2e-16 ***
## fixed.acidity 0.0251103 0.0101305 2.479 0.0133 *
## volatile.acidity -1.0698952 0.1000046 -10.698 < 2e-16 ***
## residual.sugar 0.0072856 0.0120570 0.604 0.5458
## chlorides -1.7451754 0.3922647 -4.449 9.23e-06 ***
## free.sulfur.dioxide 0.0041719 0.0021259 1.962 0.0499 *
## total.sulfur.dioxide -0.0031126 0.0006882 -4.523 6.56e-06 ***
## sulphates 0.8832980 0.1110258 7.956 3.35e-15 ***
## alcohol 0.2794893 0.0167556 16.680 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6499 on 1590 degrees of freedom
## Multiple R-squared: 0.3556, Adjusted R-squared: 0.3524
## F-statistic: 109.7 on 8 and 1590 DF, p-value: < 2.2e-16
The R-squared of the reduced model is not much different from the full model. The R-squared did not really improve with models built with other combinations of feature variables.
One possible reason that linear regression model of red-wine quality has low accuracy is that quality has a non-linear relation with other dependent variables. This prompts us to explore the relationship with a multi-class classification trees. Before building a CART model, we will split the data into a training set and a test set. CART model built on the training set will be used to predict quality on the test set. Then we will be able to measure the accuracy of the predictive capability of the model.
library(rpart)
library(rpart.plot)
library(caret)
library(e1071)
redWines$quality = factor(redWines$quality)
# Split data into a train and test set
index <- 1:nrow(redWines)
set.seed(144)
testSamples <- sample(index, trunc(length(index)/3))
rqindex <- which(colnames(redWines)=="quality")
rqlindex <- which(colnames(redWines)=="quality.label")
redTest <- redWines[testSamples,-rqlindex]
redTrain <- redWines[-testSamples,-rqlindex]
# Create rpart model with cross validation
rtr.control = trainControl(method = "cv", number = 5)
cp.grid = expand.grid( .cp = (0:4)*0.001)
set.seed(144)
rrpart.grid = train(data = redTrain, quality ~ ., method = "rpart",
trControl = rtr.control, tuneGrid = cp.grid)
rrpart.model = rrpart.grid$finalModel
prp(rrpart.model, tweak=1.5)
#Compute confusion matrix
rrpart.cpred <- predict(rrpart.model, newdata=redTest, type = "class" )
rrpart.cmatrix <- table(pred = rrpart.cpred, true = redTest[,rqindex])
rrpart.accuracy <- sum(diag(rrpart.cmatrix))/sum(rrpart.cmatrix)
rrpart.cmatrix
## true
## pred 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 1 15 166 62 4 0
## 6 1 7 61 119 38 2
## 7 0 1 10 21 22 3
## 8 0 0 0 0 0 0
rrpart.accuracy
## [1] 0.575985
Indeed, the CART model can predict quality on the test set with 58% accuracy.
We now perform EDA on white wine quality and see if they follow a similar pattern as red wines.
whiteWines <- read.csv("wineQualityWhites.csv")
whiteWines <- subset(whiteWines, select = -c(X))
str(whiteWines)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
whiteWines$quality.label = cut(whiteWines$quality, breaks=c(0, 4, 5, 6, 10),
labels=c("poor", "average", "fine", "excellent"))
wqhist1 <- ggplot(data=whiteWines, aes(x=quality)) +
ggtitle("White Wine Quality") + # Plot title text
xlab("Wine Quality") + ylab("Count") + # Axis label text
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(size=atfont, face="bold")) +
scale_x_continuous(limits=c(3,9), breaks = seq(3, 9, 1)) +
geom_histogram(binwidth = 1)
wqhist2 <- ggplot(data=whiteWines, aes(x=quality, fill=quality.label)) +
ggtitle("White Wine Categories") + # Plot title text
xlab("Wine Quality") + ylab("Count") + # Axis label text
scale_fill_brewer(type = 'div', palette = wpalette,
guide=guide_legend(reverse=TRUE, title="Quality Rating")) +
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(size=atfont, face="bold"), # Axis font
legend.title=element_text(size=ltfont, face="bold"), # Legend font
legend.position=c(.9,.7)) + # Legend position
scale_x_continuous(limits=c(3,9)) +
geom_histogram(binwidth = 1)
grid.arrange(wqhist1, wqhist2, ncol=2)
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 8 rows containing missing values (geom_bar).
Now plot all vriable pairs in the data set to identify noticable relationships.
#ggpairs(whiteWines,
# params = c(color=I('green3'), shape = I('.'), corSize=8,
# outlier.shape = I('.')),
# title = "White Wines Scatter Plot Matrix")
Similar to quality or red wines, quality of white wines also relate to alcohol content and acidity. We can examine these relations with scatterplots.
alascatter <- ggplot(data=whiteWines, aes(x=alcohol,
group=quality.label, fill=quality.label, color=quality.label)) +
scale_x_continuous(limits=c(8, 15), breaks = seq(8, 15, .5)) +
xlab("Alcohol Percentage") + # Axis label text
scale_color_brewer(type = 'div', palette = wpalette, # Legend color
guide=guide_legend(reverse=TRUE, title="Quality Rating")) +
scale_fill_brewer(guide = FALSE) +
theme_bw() + # Theme elements
theme(title=element_text(size=ptfont, face="bold"), # Title font
axis.title=element_text(size=atfont, face="bold"), # Axis font
legend.title=element_text(size=ltfont, face="bold"), # Legend font
legend.position=c(.85, .6)) # Legend position
alvascatter <- alascatter +
ylab("Volatile Acidity") +
scale_y_continuous(limits=c(0.08, 1.1), breaks = seq(0.08, 1.1, .1)) +
ggtitle("Volatile Acidity and Alcohol Content of White Wines") +
geom_point(aes(y=volatile.acidity), size=3)
alfascatter <- alascatter +
ylab("Fixed Acidity") +
scale_y_continuous(limits=c(3, 16), breaks = seq(3, 16, 1)) +
ggtitle("Fixed Acidity and Alcohol Content of White Wines") +
geom_point(aes(y=fixed.acidity), size=3)
alvascatter
alfascatter
We can also create a classification tree for prediction. The model can predict quality on the test set with 52% accuracy. The CART is drawn below.
library(rpart)
library(rpart.plot)
library(caret)
library(e1071)
whiteWines$quality = factor(whiteWines$quality)
# Split data into a train and test set
index <- 1:nrow(whiteWines)
set.seed(144)
wtestSamples <- sample(index, trunc(length(index)/3))
wqindex <- which(colnames(whiteWines)=="quality")
wqlindex <- which(colnames(whiteWines)=="quality.label")
whiteTest <- whiteWines[wtestSamples,-wqlindex]
whiteTrain <- whiteWines[-wtestSamples,-wqlindex]
# Create rpart model with cross validation
wtr.control = trainControl(method = "cv", number = 5)
wcp.grid = expand.grid( .cp = (0:4)*0.001)
set.seed(144)
wrpart.grid = train(data = whiteTrain, quality ~ . - citric.acid - density
- pH , method = "rpart", trControl = wtr.control, tuneGrid = cp.grid)
wrpart.model = wrpart.grid$finalModel
prp(wrpart.model)
#Compute confusion matrix
wrpart.cpred <- predict(wrpart.model, whiteTest, type = "class" )
wrpart.cmatrix <- table(pred = wrpart.cpred, true = whiteTest[,wqindex])
wrpart.accuracy <- sum(diag(wrpart.cmatrix))/sum(wrpart.cmatrix)
wrpart.cmatrix
## true
## pred 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 0 0 1 1 0 0 0
## 5 2 34 272 154 15 0 0
## 6 5 25 199 542 245 49 0
## 7 1 2 5 27 44 9 0
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
wrpart.accuracy
## [1] 0.5257353
```
More alcohol and higher quality usually go together. In addition, volatile acidity degrades quality, probably because it induces a vinegary taste in the wine. Residual sugar levels are somewhat important probably because sweetness is often associated with freshness. Sulphates are important too, possibly because of its link to fermentation.
This report explored the relationship betwen chemical properties of red and white wines, and their quality as identified by wine experts. Several plots including histograms, density plots, box plots, and scatter plots were used in the exploration to identify three major featurs impacting wine quality. They are alcohol content, fixed acidity, and volatile acidity.
This section reproduces a few of those plots. For example, the density plots of red wine quality over the three features are shown below. Detailed code for generating these plots have already been included in earlier sections.
grid.arrange(alDensity, vaDensity, faDensity, nrow=2, ncol=2)
For white wines, the relationsip between quality, alcohol content and volatile acidity can be seen in the following scatter plot.
alvascatter
We also built regression and classification models to predict wine quality from its chemical properties. The classification trees identified two additional significant features, namely residual sugar and sulphates. In the following we show the classification-tree for white wines. Detailed code for generating the model have already been included in an earlier section.
prp(wrpart.model)
In this report, we have explored the relationship betwen chemical properties of wines, such as fixed and volatile acidity, residual sugar, sulphates, alcohol content, and the quality of the wines as identified by the experts. We started with 11 different independent input variables, and showed that one can pick fine or excellent wines by focusing on 5 or fewer primary contributors. However, our prdictive models are not accurate enough (58% for red and 52 % for white). Despite providing interesting insights, they need to more refined before providing reliable wine ratings. In addition to linear regression and CART models used in this project, there exists other advanced techniques for extracting features and building and training models that can be used to create a better predictive model for wine quality.